Introduction

Coronavirus disease(COVID19) is an infectious disease caused by a newly discovered coronavirus. It has spread to numerous countries across all continent since the first discovery in Wuhan, China back in Nov 2019 and was declared as pandemic by WHO on March 11.

Various countries has came out measure/restrictions to respond to COVID-19. Since “circuit breaker”, a partial nationwide lockdown, where only essential services could allow opened, Singapore(SG) residents have started to feel a great impact on daily life where they are encouraged to stay home as much as possible and wearing of mask became mandatory when going out. SG government has constantly revising policies and social restrictions. Three phases of planned reopening were announced on 19 May namely “Safe Reopening” (Phase1) “Safer Transition” (Phase2), and finally “Safe Nation” (Phase3).

Problem Statement

Microblogging has become one of the most useful tools for sharing everyday life events and news and for expressing opinions about those events. As Twitter posts are short and constantly being generated, they are a great source for providing public sentiment towards events that occurred throughout the COVID-19 period in Singapore.

In our Capstone Project, we perform exploratory data analysis about SG COVID situation and sentiment analysis and modeling on the tweets about COVID19 to seek to answer the following research questions:

  1. What are the main prevalent sentiment and emotions expressed in words in Singapore tweets about current COVID situation?

  2. Is there any change of sentiment over a period of time amidst global reopening with higher vaccination rate, in contrast growing new daily cases/death locally?

For our data science project, we activated the following packages, using the Tidyverse approach.

# Load necessary packages
pacman::p_load(tidyverse, broom, modelr, lubridate, 
               tidytext, wordcloud2, wordcloud, reshape2,
               textdata,   # Employing Lexicon
               jtools, huxtable, gridExtra,
               psych, Hmisc, car, sandwich, 
               ggthemes, scales, ggstance, 
               gvlma, gtrendsR, rtweet, glue, ggplot2, quantmod,gifski,
               caret, DT, dplyr)

my_colors <- c("#05A4C0", "#85CEDA", "#D2A7D8", "#A67BC5", "#BB1C8B", "#8D266E", "gold4", "darkred", "deepskyblue4")

my_theme <- theme(plot.background = element_rect(fill = "grey98", color = "grey20"),
                  panel.background = element_rect(fill = "grey98"),
                  panel.grid.major = element_line(colour = "grey87"),
                  text = element_text(color = "grey20"),
                  plot.title = element_text(size = 22),
                  plot.subtitle = element_text(size = 17),
                  axis.title = element_text(size = 15),
                  axis.text = element_text(size = 15),
                  legend.box.background = element_rect(color = "grey20", fill = "grey98", size = 0.1),
                  legend.box.margin = margin(t = 3, r = 3, b = 3, l = 3),
                  legend.title = element_blank(),
                  legend.text = element_text(size = 15),
                  strip.text = element_text(size=17))

Import

Then, we imported our dataset.

Data Source 1: SG COVID DATA

SSA <- read.csv("Covid-19.csv")

The dataset contains time series data on covid-19 cases in Singapore on counts of confirmed, discharged, hospitalised, deaths, imported cases.

Within the dataset, Some of few key fields we are interested in reflecting current Singapore COVID-19 situation, namely Daily.Confirmed, Daily Deaths, Still.Hospitalised and Intensive.Care.Unit..ICU which will also serve as our dependent (discrete x) variable.

Notes:

  • All figures (excluding variables with names containing “MOH report”) are as at press release within the day and are not back-dated to update any changes that the Ministry of Health (MOH) might have made.

  • “Daily imported” and “Daily Local transmission” make up “Daily Confirmed”.

  • Still Hospitalised" is computed based on “Total Confirmed” - “Cumulative Discharged” - “Discharged to Isolation” - “Cumulative Deaths” - “Tested positive demise” (summed). This might not tally with the sum of “General Wards MOH report” and “Intensive Care Unit (ICU)”, indicating dirty data.

  • “Cumulative Vaccine Doses”, “Cumulative Individuals Vaccinated”, and “Cumulative Individuals Vaccination Completed” added from 1 Jul 2021. The data is tied to the date of the report but the count is as of the previous day. i.e. Figures indicated for 1 Jul 2021 reflect the total number of doses of COVID-19 vaccines as of Jun 30. “Cumulative Individuals Vaccinated” refers to the number of individuals who have received at least one dose of vaccine.

glimpse(SSA)
## Rows: 999
## Columns: 37
## $ Date                                                          <chr> "2020-01…
## $ Daily.Confirmed                                               <int> 1, 2, 1,…
## $ False.Positives.Found                                         <int> NA, NA, …
## $ Cumulative.Confirmed                                          <int> 1, 3, 4,…
## $ Daily.Discharged                                              <int> 0, 0, 0,…
## $ Passed.but.not.due.to.COVID                                   <int> 0, 0, 0,…
## $ Cumulative.Discharged                                         <int> 0, 0, 0,…
## $ Discharged.to.Isolation                                       <int> 0, 0, 0,…
## $ Still.Hospitalised                                            <int> 1, 3, 4,…
## $ Daily.Deaths                                                  <int> 0, 0, 0,…
## $ Cumulative.Deaths                                             <int> 0, 0, 0,…
## $ Tested.positive.demise                                        <int> 0, 0, 0,…
## $ Daily.Imported                                                <int> 1, 2, 1,…
## $ Daily.Local.transmission                                      <int> 0, 0, 0,…
## $ Local.cases.residing.in.dorms.MOH.report                      <int> NA, NA, …
## $ Local.cases.not.residing.in.doms.MOH.report                   <int> NA, NA, …
## $ Intensive.Care.Unit..ICU.                                     <int> 0, 0, 0,…
## $ General.Wards.MOH.report                                      <int> NA, NA, …
## $ In.Isolation.MOH.report                                       <int> NA, NA, …
## $ Total.Completed.Isolation.MOH.report                          <int> NA, NA, …
## $ Total.Hospital.Discharged.MOH.report                          <int> NA, NA, …
## $ Requires.Oxygen.Supplementation.or.Unstable                   <int> NA, NA, …
## $ Linked.community.cases                                        <int> NA, NA, …
## $ Unlinked.community.cases                                      <int> NA, NA, …
## $ Phase                                                         <chr> "", "", …
## $ Cumulative.Vaccine.Doses                                      <int> NA, NA, …
## $ Cumulative.Individuals.Vaccinated                             <int> NA, NA, …
## $ Cumulative.Individuals.Vaccination.Completed                  <int> NA, NA, …
## $ Perc.population.completed.at.least.one.dose                   <chr> "", "", …
## $ Perc.population.completed.vaccination                         <chr> "", "", …
## $ Sinovac.vaccine.doses                                         <int> NA, NA, …
## $ Cumulative.individuals.using.Sinovac.vaccine                  <int> NA, NA, …
## $ Doses.of.other.vaccines.recognised.by.WHO                     <int> NA, NA, …
## $ Cumulative.individuals.using.other.vaccines.recognised.by.WHO <int> NA, NA, …
## $ Number.taken.booster.shots                                    <int> NA, NA, …
## $ Perc.population.taken.booster.shots                           <chr> "", "", …
## $ X                                                             <lgl> NA, NA, …

Data Source 1: Tidy & Transform

The first thing we did with our loaded dataset was to remove the non-relevant columns and remain only those we will perform analysis and modeling. Also, we transformed the dataset into long format for data exploratory visualization.

SSA1<- tibble(SSA)

SSA1 <- SSA1[-(1:626) , -(18:37)]
SSA1 <- SSA1[ , -(11:16)]  
SSA1 <- SSA1[ , -(3:8)]
SSA1 <- SSA1[-(35:373) , ]

SSA1$Date <- as.Date(SSA1$Date)

#This illustrate a comparison of the daily cases for Death, Confirmed, Hospitalised and ICU over the study period

SSA_chart <- SSA1 %>% pivot_longer(cols = Daily.Confirmed:Intensive.Care.Unit..ICU. , 
                                    names_to = "Cases", 
                                    values_to = "Value")


COLORS <- c(Daily.Confirmed = "#c381fd", Daily.Death ="#4815aa",  
            Still.Hospitalised = "#f2626b" , Intensive.Care.Unit..ICU. = "#feba4f")

ggplot(SSA_chart, aes(x = Date, y = Value, group = Cases, color = Cases)) +
  geom_line(size = 0.9) +
  scale_color_manual(values = COLORS)+
  scale_y_continuous("Cases", limits = c(0,5500)) + 
  labs(title="Comparison of Daily Cases\nfor Death, Confirmed, Hospitalised and ICU")+
  theme(legend.title = element_text(color = "blue", size = 10)) +
  my_theme + theme(axis.title.x = element_blank(),
                   legend.position = "bottom") +
  scale_x_date(date_breaks = "1 day") + 
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 )) 

Data Source 2: SG TWEETER DATA

# We observed 7-days data usually capped below 3000 tweets per extraction.
# sg_tweets <- search_tweets(q = "#coronavirus OR #covid19 OR #COVID OR #stayhome OR #Covid-19 OR #pandemic OR #virus OR #social-distance OR #self-quarantine OR $swab-test OR #PCR OR #infection-rate", 
#                                         n = 3000, 
#                                         lang = "en",
#                                         include_rts = F,
#                                         geocode = lookup_coords("singapore")
                                
sg_tweets <- read.csv("covid19_sg_tweeter_1010_1115.csv")

Let’s explore our tweets data with regards to COVID-19 since our first extraction on 18th October to understand sentiment after recent sharp rise in number of local cases and death since end-September.

We also identified 2 key events over the period to analyse further to answer our research question if the event will have significance influence on the public sentiment.

2021-10-20

  • PM Lee’s address on COVID-19 situation
  • Announcement on the extension of the Stabilisation Phase for four weeks, through to 21 November 2021.
  • Unvaccinated people can no longer eat at hawker centres, enter shopping malls.

2021-11-08

  • Allow up to five fully vaccinated persons from the same household to dine-in together at food and beverage (F&B) establishments
  • Loose restrictions on sports and selected MICE (Meetings, Incentives, Conferences and Exhibitions) events.
  • Resuming more activities in schools, in preparation for the larger-scale safe resumption of co-curricular learning activities in the coming school year.
  • Adjusting border measures and extending Vaccinated Travel Lanes(VTL) to Malaysia, Finland and Sweden.
# Basic EDA on amount of tweet in time (ALL)
options(repr.plot.width=20, repr.plot.height=9)

sg_tweets %>% 
  select(created_at) %>% 
  mutate(date = ymd(as.Date(created_at))) %>% 
  group_by(date) %>% 
  summarise(n = n(), .groups = "drop_last") %>%
  ggplot(aes(x=date, y = n)) + 
  geom_line(size = 1, color = my_colors[1]) +
  coord_cartesian(clip = 'off') +
  geom_vline(xintercept = as.Date('2021-10-20'), linetype="dotted", size = 1.5, color = "red") +
  geom_vline(xintercept = as.Date('2021-11-08'), linetype="dotted", size = 1.5, color = "darkblue") +
  my_theme + theme(axis.title.x = element_blank()) +
  scale_x_date(date_breaks = "1 day") + 
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 )) +
  labs(title = "Number of COVID-19 Tweets shared between 10th Oct - 15th Nov", subtitle = "Number of tweets spiked on key events") +
    geom_label(aes(x=as.Date('2021-10-19'), y = 380, label = "PM Lee's address on COVID-19"), color = "red", size = 4, angle = 90, fontface = "bold") +
    geom_label(aes(x=as.Date('2021-11-07'), y = 380, label = "More Opening on COVID-19 restrictions"  ), color = "darkblue", size = 4, angle = 90, fontface = "bold") 

Data Source 2: Tidy & Transform

# Step 1: Tokenization ----
sg_tweets_id <- sg_tweets %>% 
  mutate(created_at = as.Date(sg_tweets$created_at)) %>% 
  rowid_to_column("id")

tidy_tweets_token <- sg_tweets_id %>%
  drop_na(text) %>% 
  select(id, created_at, status_id, text) %>% 
  filter(text != "") %>% 
  unnest_tokens(word, text, token = "tweets") 

# Step 2: Cleaning ----
tweets_cleaned <- tidy_tweets_token %>%
  anti_join(tidytext::stop_words)

# Manual cleaning, filtering out unwanted words
tweets_token_cleaned <- tweets_cleaned %>%
  filter(!word %in% c("singapore", "covid", "covid19","positive","negative","oct","nov","news","amp","reuters","news","daily","malaysia","november","october","october","press","journal","amid","weekly","days","weeks","china","chinese","report","am","pm","dont","taking","found","morning","bloomberg","months","month","india","media","week","read","reports","data","europe","monday","tuesday","wednesday","thursday","friday","satursday","sunday","wall","street") & !str_detect(word,"^#|^@") & !str_detect(word, "^[:digit:]+$"))

Visualisation for Basic Exploratory Data Analysis

A Simple Word Cloud

covid_tweets_for_wc <- tweets_token_cleaned %>% 
  group_by(word) %>% 
  summarise(frequency = n()) %>% 
  arrange(desc(frequency))

covid_tweets_for_wc %>% 
  filter(frequency > 3) %>% 
  wordcloud2(backgroundColor = "black", 
             color = "random-light")

Word Cloud (Positive vs Negative)

# A Postive-Negative Word Cloud by using BING
BING <- get_sentiments("bing")

tweets_token_cleaned %>% 
  inner_join(BING, by="word") %>%
  count(word, sentiment, sort=T) %>% 
  acast(word ~ sentiment, value.var = "n", fill=0) %>% 
  comparison.cloud(colors=my_colors[c(5, 1)], max.words = 400, title.size = 2,
                   scale = c(3,.5))

Top 3 Most Negative Tweets in the dataset

AFINN <- get_sentiments("afinn")

## TOP 3 MOST NEGATIVE TWEET ----
tweets_AFINN_indexed <- tweets_token_cleaned %>% 
  inner_join(AFINN, by = "word")

tweet_level_sentiment <- tweets_AFINN_indexed %>% 
  group_by(id) %>% 
  summarise(average_sentiment = mean(value),
            n_of_words_indexed = n()
  )

top3_negative <- tweet_level_sentiment %>% 
  arrange(average_sentiment) %>% 
  head(3) 

sg_tweets_id %>% 
  filter(id %in% top3_negative$id ) %>% 
  select(text) %>% 
  pull()
## [1] "'They don't choose to have it': Why Covid-19 is hell for people with OCD"                                                                                                      
## [2] "Asked my mum to get me donuts 🍩 I’ve no idea why i did that when I can’t taste no shit. Arghh but just cravings 😫 fuck covid 19, fuck the monthly cycle, fuck everything! 😖"
## [3] "deadass my eyes went from covid 19 then positive bitch"

Top 3 Most Positive Tweets in the dataset

# TOP 3 MOST POSITIVE TWEETS ----
top3_positive <- tweet_level_sentiment %>% 
  arrange(desc(average_sentiment)) %>% 
  head(3)

sg_tweets_id %>% 
  filter(id %in% top3_positive$id) %>% 
  select(text) %>% 
  pull()
## [1] "…or else they find alternative employment. They are also bringing in a vax passport to enter some venues buildings etc, unless there is a current Covid-19 test (often valid only for a day or 2). It's basically becoming a no vax, no job, &amp; no fun situation. This I approve."                                                  
## [2] "“MOH said the doctor had no known medical conditions and was partially vaccinated with a non-mRNA Covid-19 vaccine under the Special Access Route.”\n\nWow, this has to be stated so explicitly. I wonder why? \n\n⁦⁩ ⁦@Huigoon⁩"                                                                                                          
## [3] "@DavidBieleski @bergeron_brent @katiehasedits This trend was evident on 9Aug 2020, Back then out of the 3, SG had the most cases with 55,104, whilst NZ, the least with 1,219. Even back then Greece the highest with 213 COVID-19 related deaths.\n\nFast forward to 5 Nov 2021, Greece is the winner in Covid-19 cases &amp; deaths."

Overall Emotion Analysis

Distribution Breakdown by Emotion Class using NRC technique

NRC <- get_sentiments("nrc")

options(repr.plot.width=15, repr.plot.height=9)

tweets_token_cleaned %>% 
  inner_join(NRC, "word") %>%
  filter(!sentiment %in% c("positive", "negative")) %>% 
  count(sentiment, sort=T) %>% 
  ggplot(aes(x=reorder(sentiment, n), y=n)) +
  geom_bar(stat="identity", aes(fill=n), show.legend=F) +
  geom_label(aes(label=format(n, big.mark = ",")), size=5, fill="white") +
  labs(x="Sentiment", y="Frequency", title="What is the overall mood in Tweets?") +
  scale_fill_gradient(low = my_colors[3], high = my_colors[1], guide="none") +
  coord_flip() + 
  my_theme + theme(axis.text.x = element_blank())

Most Fequent Words by Emotion Class

#options(repr.plot.width=25, repr.plot.height=9)

tweets_token_cleaned %>% 
  inner_join(NRC, "word") %>% 
  count(sentiment, word, sort=T) %>%
  group_by(sentiment) %>% 
  arrange(desc(n)) %>% 
  slice(1:7) %>% 
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend = F) +
  facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 5) +
  coord_flip() +
  my_theme + theme(axis.text.x = element_blank()) +
  labs(x="Word", y="Frequency", title="Sentiment split by most frequent words") +
  scale_fill_manual(values = c(my_colors, "#BE82AF", "#9D4387", "#DEC0D7",
                               "#40BDC8", "#80D3DB", "#BFE9ED"))

Research on Influence from Singapore PM’s address

20 Oct 2021

Here, we are interested in a research question: Did COVID-19 key events within our sentiment analysis period on 20 Oct occured below changed the public sentiment or gain more trust in effects with the leadership?

  1. PM’s address the nation on COVID-19 situation in Singapore:
  • The path to a “New Normal”, diverted from original zero COVID approach and to live with COVID19.
  • local cases spiked sharply over the past few weeks
  • Asked for unity and COVID resilience.
  1. Announcement on COVID social curbs to be extended another month to 21 Nov. originally slated to be in place until 24 Oct.
  • Dining out to 2 people
  • Work from home remains the default.

We are going to use Regression Discontinuity Analysis on the causal inference and effect.

Firstly, we explore the data with 10 days before and after PM Lee’s address, assuming date close to the cut off on 20 Oct has more relevant effects.

Data Overview

Using AFINN technique

sentiment_daily <- tweets_AFINN_indexed %>% 
  group_by(created_at) %>% 
  summarise(average_sentiment = mean(value),
            n_of_words_indexed = n()) 

  # Plot
sentiment_daily %>% 
  filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30')) %>% 
  ggplot(aes(x = created_at, y = average_sentiment) ) +
  geom_point(size = 2, color = my_colors[1]) +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1, linetype="dotdash", color = my_colors[6]) +
  scale_x_date(date_breaks = "1 day") + 
  ggtitle("Distribution of Average Sentiment \n10 days before & after PM address") +
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 ))

Using NRC technique for Emotion Classification

# Extract for analysis period
 tweets_token_analysis_period <- tweets_token_cleaned %>% 
  filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30')) 

classified_sentiment <- tweets_token_analysis_period %>% 
  inner_join(NRC, "word") %>% 
  group_by(sentiment, created_at) %>% 
  summarise(cnt = n()) 

# Plot Chart
classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>% 
  ggplot(aes(x=created_at, y =cnt, color = sentiment)) +
  geom_point() +
  facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 4) +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8]) +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d") +
    theme(axis.text.x = element_text(angle = 45, size = rel(0.8), vjust = 1, hjust = 1 )) +
  labs(y="Count of Emotional Words", x="Period of Date")

Using Radar Chart, another visualisation chart.

# Data transformation
char_sentiment <- classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
  group_by(sentiment, covid_address_effect) %>% 
  summarise(char_sentiment_count = sum(cnt)) 

total_char <-   classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
  group_by(covid_address_effect) %>% 
  summarise(total = sum(cnt))

# Plot Chart
char_sentiment %>% 
  inner_join(total_char, by="covid_address_effect") %>% 
  mutate(percent = char_sentiment_count / total * 100 ) %>% 
  select(-char_sentiment_count, -total) %>% 
  spread(covid_address_effect, percent)  %>% 
  radarchart::chartJSRadar(showToolTipLabel = T, main="Any Effects on the Emotion Class Percentage After Address? - No", maxScale=25, responsive=T,addDots = T, colMatrix = grDevices::col2rgb(my_colors[c(2,4)]),lineAlpha = 0.7, polyAlpha =0.2)

Simple Linear Regression

OLS Linear Regression on average sentiment over time period

merged_dataset_RDD <- SSA1 %>% 
  inner_join(sentiment_daily, by = c("Date" = "created_at")) %>% 
  filter(Date >= as.Date('2021-10-10') & Date <= as.Date('2021-10-30')) 

# add dummy variable for pre-effect = 0, and post-effect = 1
merged_dataset_RDD <- merged_dataset_RDD %>% 
  mutate (covid_address_effect = as.factor(ifelse(Date < '2021-10-20','Before','After')))

merged_dataset_RDD %>% 
  lm(average_sentiment ~ covid_address_effect + I(Date - as.Date('2021-10-20')),.) %>% 
  summary()
## 
## Call:
## lm(formula = average_sentiment ~ covid_address_effect + I(Date - 
##     as.Date("2021-10-20")), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36397 -0.09578  0.01303  0.08805  0.32045 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.340705   0.080718  -4.221 0.000514 ***
## covid_address_effectBefore      -0.238246   0.150119  -1.587 0.129913    
## I(Date - as.Date("2021-10-20")) -0.007743   0.012382  -0.625 0.539602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1718 on 18 degrees of freedom
## Multiple R-squared:  0.2093, Adjusted R-squared:  0.1214 
## F-statistic: 2.382 on 2 and 18 DF,  p-value: 0.1209
merged_dataset_RDD %>% 
  ggplot(aes(x = Date, y = average_sentiment)) +
  geom_point(aes(color = covid_address_effect)) + 
  geom_smooth(method = "lm") +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  ggthemes::theme_fivethirtyeight() +
  ylab("Average Sentiment") +
  theme(axis.title.y = element_text(), legend.position = "bottom") +
  labs(title="OLS Simple Regression Model")

### Regression Discontinuity Analysis

We perform Regression Discontinuity Analysis on the effects of PM address event.

We expect to observe there is a high volume on Oct. 20, jump in sentiment score/count on Oct. 19 and Oct. 21

# RDD
RDD <- merged_dataset_RDD %>% 
  ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  ggthemes::theme_fivethirtyeight() + 
  ggtitle("Regression Discontinuity Analysis") +
  ylab("Average Sentiment") +
  theme(axis.title.y = element_text()) +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])

# Difference in Means Test
RDD_box <- merged_dataset_RDD %>% 
  ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
  geom_boxplot(outlier.colour="black",
               outlier.size=2, notch=FALSE) + 
  geom_point() +
  ggthemes::theme_fivethirtyeight() + 
  ggtitle("Test for Significant Difference") +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])

gridExtra::grid.arrange(RDD, RDD_box, ncol=2)

Perform T-test to find significance in difference between 2 groups (Before and After PM address)

# Conduct a difference of means test
# Hypothesis: H0 : mean of pre-address_effect = mean of post-address_effect
merged_dataset_RDD %>%
  t.test(average_sentiment ~ covid_address_effect, .)
## 
##  Welch Two Sample t-test
## 
## data:  average_sentiment by covid_address_effect
## t = 2.1578, df = 18.219, p-value = 0.04453
## alternative hypothesis: true difference in means between group After and group Before is not equal to 0
## 95 percent confidence interval:
##  0.004272338 0.309624417
## sample estimates:
##  mean in group After mean in group Before 
##           -0.3794175           -0.5363659

Model

For the preparation of the model, we created and ran a correlational matrix, to see how our variables of interest (within the model) are related.

pacman::p_load(Hmisc, broom, DT)

merged_dataset <- SSA1 %>% 
  inner_join(sentiment_daily, by = c("Date" = "created_at")) %>% 
  filter(Date >= as.Date('2021-10-10') & Date <= as.Date('2021-11-12')) %>% 
  mutate (covid_address_effect = as.factor(ifelse(Date < '2021-10-20','Before','After')))

merged_dataset$Date <- as.Date(merged_dataset$Date)

merged_dataset <- merged_dataset[ , -(7:8)]
#Getting Summary
merged_dataset %>% 
  select("average_sentiment", "Daily.Confirmed", "Daily.Deaths","Still.Hospitalised", "Intensive.Care.Unit..ICU.") %>% 
  summary(.)
##  average_sentiment Daily.Confirmed  Daily.Deaths   Still.Hospitalised
##  Min.   :-0.8120   Min.   :1767    Min.   : 6.00   Min.   :1434      
##  1st Qu.:-0.5179   1st Qu.:2943    1st Qu.: 9.00   1st Qu.:1584      
##  Median :-0.4315   Median :3182    Median :12.00   Median :1640      
##  Mean   :-0.4041   Mean   :3206    Mean   :12.03   Mean   :1633      
##  3rd Qu.:-0.2725   3rd Qu.:3472    3rd Qu.:14.75   3rd Qu.:1686      
##  Max.   : 0.1050   Max.   :5324    Max.   :18.00   Max.   :1757      
##  Intensive.Care.Unit..ICU.
##  Min.   :41.00            
##  1st Qu.:58.25            
##  Median :64.00            
##  Mean   :61.88            
##  3rd Qu.:68.50            
##  Max.   :75.00
#For the preparation of the model, we created and ran a correlational matrix, 
#to see how our variables of interest (within the model) are related.

merged_dataset %>% 
  select(average_sentiment, Daily.Confirmed, Daily.Deaths, Still.Hospitalised,Intensive.Care.Unit..ICU.) %>%
  as.matrix(.) %>% 
  rcorr(.) %>% 
  tidy(.) %>% 
  rename(variable_1 = column1,
         variable_2 = column2,
         corr = estimate) %>% 
  mutate(abs_corr = abs(corr)
  ) %>% 
  datatable(options = list(scrollX = T),
  ) %>% 
  formatRound(columns = c("corr", "p.value", "abs_corr"), 
              digits = 3)
#It is worth noting that Intensive.Care Unit and Still Hospitalized has the highest correlation

It is worth noting that Intensive Care Unit and Still Hospitalized has the highest correlation from the result.

We will use Merged Dataset for predicting average sentiment based on daily ICU, Confirmed, Hospitalised and Daily Deaths cases. We’ll randomly split the data into training set (70% for building a predictive model) and test set (30% for evaluating the model). Make sure to set seed for reproducibility.

# Split the data into training and test set
set.seed(1234)
training.samples <- merged_dataset$average_sentiment %>%
  createDataPartition(p = 0.7, list = FALSE)
train.data  <- merged_dataset[training.samples, ]
test.data <- merged_dataset[-training.samples, ]


#Note that, if you have many predictor variables in your data, you can simply include all the available variables in the model using ~.:
model <- lm(average_sentiment ~., data = train.data)
summary(model)$coef
##                                Estimate   Std. Error    t value  Pr(>|t|)
## (Intercept)               -3.617841e+01 1.288802e+02 -0.2807136 0.7818134
## Date                       1.854779e-03 6.826509e-03  0.2717024 0.7886358
## Daily.Confirmed           -7.959924e-05 5.810620e-05 -1.3698924 0.1859048
## Still.Hospitalised         2.180290e-04 7.332021e-04  0.2973655 0.7692541
## Daily.Deaths               1.040059e-02 1.286043e-02  0.8087276 0.4281879
## Intensive.Care.Unit..ICU.  7.192509e-03 6.900859e-03  1.0422629 0.3097256
summary(model)
## 
## Call:
## lm(formula = average_sentiment ~ ., data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30303 -0.12142 -0.03604  0.13835  0.34715 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)
## (Intercept)               -3.618e+01  1.289e+02  -0.281    0.782
## Date                       1.855e-03  6.827e-03   0.272    0.789
## Daily.Confirmed           -7.960e-05  5.811e-05  -1.370    0.186
## Still.Hospitalised         2.180e-04  7.332e-04   0.297    0.769
## Daily.Deaths               1.040e-02  1.286e-02   0.809    0.428
## Intensive.Care.Unit..ICU.  7.193e-03  6.901e-03   1.042    0.310
## 
## Residual standard error: 0.1907 on 20 degrees of freedom
## Multiple R-squared:  0.2777, Adjusted R-squared:  0.09712 
## F-statistic: 1.538 on 5 and 20 DF,  p-value: 0.2231
# Summarize the model
summary(model)
## 
## Call:
## lm(formula = average_sentiment ~ ., data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30303 -0.12142 -0.03604  0.13835  0.34715 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)
## (Intercept)               -3.618e+01  1.289e+02  -0.281    0.782
## Date                       1.855e-03  6.827e-03   0.272    0.789
## Daily.Confirmed           -7.960e-05  5.811e-05  -1.370    0.186
## Still.Hospitalised         2.180e-04  7.332e-04   0.297    0.769
## Daily.Deaths               1.040e-02  1.286e-02   0.809    0.428
## Intensive.Care.Unit..ICU.  7.193e-03  6.901e-03   1.042    0.310
## 
## Residual standard error: 0.1907 on 20 degrees of freedom
## Multiple R-squared:  0.2777, Adjusted R-squared:  0.09712 
## F-statistic: 1.538 on 5 and 20 DF,  p-value: 0.2231
# Make predictions
predictions <- model %>% predict(test.data)
RMSE(predictions, test.data$average_sentiment)
## [1] 0.2429985
# (b) R-square
R2(predictions, test.data$average_sentiment)
## [1] 0.002197841

Specify OLS model

We ran two regression models. The first regressed daily cases onto average sentiment (model1).

\[ \begin{eqnarray} \widehat{swl} = intercept + b_1Daily.Deaths + b_2Daily.Confirmed + b_3Still.Hospitalised + b_4Intensive.Care.Unit..ICU.+ \epsilon \end{eqnarray} \]

Our key investigation lies in the next model, in which we regressed daily cases orientations, along with interaction terms, onto average sentiment (model2).

\[ \begin{eqnarray} \widehat{swl} = intercept + b_1Daily.Deaths + b_2Daily.Confirmed + b_3Still.Hospitalised + b_4ICU + \\ + b_5Daily.Deaths \times icu + b_6Daily.Confirmed \times icu + b_7Still.Hospitalised \times icu + \epsilon \end{eqnarray} \]

model1 <- lm(average_sentiment~  Daily.Deaths + Daily.Confirmed +Still.Hospitalised + Intensive.Care.Unit..ICU., 
             train.data)

tidy(model1) %>% as_tibble()
termestimatestd.errorstatisticp.value
(Intercept)-1.16    1.09    -1.07 0.296 
Daily.Deaths0.0119  0.0113  1.06 0.302 
Daily.Confirmed-8.44e-055.42e-05-1.56 0.134 
Still.Hospitalised0.0002190.0007170.3060.763 
Intensive.Care.Unit..ICU.0.00852 0.00475 1.79 0.0873
glance(model1)
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.2750.1370.1861.990.13349.56-7.110.4370.732126
model2 <- lm(average_sentiment ~ (Daily.Confirmed + Daily.Deaths +Still.Hospitalised) * Intensive.Care.Unit..ICU., 
             train.data)

tidy(model2) %>% as_tibble()
termestimatestd.errorstatisticp.value
(Intercept)8.58    8.54    1.01 0.328
Daily.Confirmed0.0004290.0005760.7450.466
Daily.Deaths-0.0478  0.0982  -0.4860.633
Still.Hospitalised-0.00625 0.00497 -1.26 0.225
Intensive.Care.Unit..ICU.-0.171   0.146   -1.17 0.256
Daily.Confirmed:Intensive.Care.Unit..ICU.-8.35e-068.95e-06-0.9330.363
Daily.Deaths:Intensive.Care.Unit..ICU.0.0009640.00151 0.6380.531
Still.Hospitalised:Intensive.Care.Unit..ICU.0.0001188.64e-051.36 0.189
glance(model2)
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.3670.1210.1881.490.233711.3-4.646.680.6371826
summary(model1)
## 
## Call:
## lm(formula = average_sentiment ~ Daily.Deaths + Daily.Confirmed + 
##     Still.Hospitalised + Intensive.Care.Unit..ICU., data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29454 -0.12106 -0.02985  0.13562  0.34812 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)               -1.163e+00  1.085e+00  -1.071   0.2962  
## Daily.Deaths               1.194e-02  1.129e-02   1.058   0.3022  
## Daily.Confirmed           -8.437e-05  5.416e-05  -1.558   0.1342  
## Still.Hospitalised         2.193e-04  7.168e-04   0.306   0.7627  
## Intensive.Care.Unit..ICU.  8.523e-03  4.753e-03   1.793   0.0873 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1864 on 21 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.1369 
## F-statistic: 1.992 on 4 and 21 DF,  p-value: 0.1328
summary(model2)
## 
## Call:
## lm(formula = average_sentiment ~ (Daily.Confirmed + Daily.Deaths + 
##     Still.Hospitalised) * Intensive.Care.Unit..ICU., data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26312 -0.11455 -0.02999  0.12202  0.29962 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   8.581e+00  8.537e+00   1.005
## Daily.Confirmed                               4.289e-04  5.757e-04   0.745
## Daily.Deaths                                 -4.778e-02  9.822e-02  -0.486
## Still.Hospitalised                           -6.247e-03  4.972e-03  -1.257
## Intensive.Care.Unit..ICU.                    -1.712e-01  1.458e-01  -1.174
## Daily.Confirmed:Intensive.Care.Unit..ICU.    -8.350e-06  8.952e-06  -0.933
## Daily.Deaths:Intensive.Care.Unit..ICU.        9.637e-04  1.510e-03   0.638
## Still.Hospitalised:Intensive.Care.Unit..ICU.  1.178e-04  8.638e-05   1.364
##                                              Pr(>|t|)
## (Intercept)                                     0.328
## Daily.Confirmed                                 0.466
## Daily.Deaths                                    0.633
## Still.Hospitalised                              0.225
## Intensive.Care.Unit..ICU.                       0.256
## Daily.Confirmed:Intensive.Care.Unit..ICU.       0.363
## Daily.Deaths:Intensive.Care.Unit..ICU.          0.531
## Still.Hospitalised:Intensive.Care.Unit..ICU.    0.189
## 
## Residual standard error: 0.1882 on 18 degrees of freedom
## Multiple R-squared:  0.367,  Adjusted R-squared:  0.1209 
## F-statistic: 1.491 on 7 and 18 DF,  p-value: 0.2325

We tested if model2, with interaction terms, enhances the explanatory power of the model using anova function.

anova(model1, model2)
Res.DfRSSDfSum of SqFPr(>F)
210.73              
180.63730.09260.8720.474
#The results of the analysis suggest that adding the interaction terms does significantly increases the R-squared of model2, as compared to model1.

The results of the analysis suggest that adding the interaction terms significantly increases the R-squared of model2, as compared to model1.

Assumption Check

Prof. Roh’s Note: “Here, please check the linearity assumption, using Global Validation of Linear Model Assumption (gvlma) package. You may visualize the core infomation of assumption checks, using ggfortify package.”

library(gvlma)
gvlma(model2)
## 
## Call:
## lm(formula = average_sentiment ~ (Daily.Confirmed + Daily.Deaths + 
##     Still.Hospitalised) * Intensive.Care.Unit..ICU., data = train.data)
## 
## Coefficients:
##                                  (Intercept)  
##                                    8.581e+00  
##                              Daily.Confirmed  
##                                    4.289e-04  
##                                 Daily.Deaths  
##                                   -4.778e-02  
##                           Still.Hospitalised  
##                                   -6.247e-03  
##                    Intensive.Care.Unit..ICU.  
##                                   -1.712e-01  
##    Daily.Confirmed:Intensive.Care.Unit..ICU.  
##                                   -8.350e-06  
##       Daily.Deaths:Intensive.Care.Unit..ICU.  
##                                    9.637e-04  
## Still.Hospitalised:Intensive.Care.Unit..ICU.  
##                                    1.178e-04  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model2) 
## 
##                     Value p-value                Decision
## Global Stat        4.2755  0.3700 Assumptions acceptable.
## Skewness           0.4938  0.4822 Assumptions acceptable.
## Kurtosis           0.8413  0.3590 Assumptions acceptable.
## Link Function      1.6721  0.1960 Assumptions acceptable.
## Heteroscedasticity 1.2683  0.2601 Assumptions acceptable.
library(ggthemes)
theme_set(theme_fivethirtyeight())

library(ggfortify)
autoplot(gvlma(model2))

Check Multicollinearity

library(car)
vif(model1);vif(model2)
##              Daily.Deaths           Daily.Confirmed        Still.Hospitalised 
##                  1.057994                  1.039943                  1.398221 
## Intensive.Care.Unit..ICU. 
##                  1.355200
##                              Daily.Confirmed 
##                                    115.35748 
##                                 Daily.Deaths 
##                                     78.64201 
##                           Still.Hospitalised 
##                                     66.02596 
##                    Intensive.Care.Unit..ICU. 
##                                   1251.42553 
##    Daily.Confirmed:Intensive.Care.Unit..ICU. 
##                                    174.31819 
##       Daily.Deaths:Intensive.Care.Unit..ICU. 
##                                    103.30087 
## Still.Hospitalised:Intensive.Care.Unit..ICU. 
##                                   1507.95036

Report the Results with kable in R Markdown

Prof. Roh’s Note: “Now that the assumption check is done, you might want to put the results into a prettier format of table. The default print-out of table in R Markdown does not look good. The knitr package contains a very basic command, kable, which will format an array or data frame more presentable for display. Thus, use the following for your report.”

library(knitr) # Please install the package "knitr" first.
library(kableExtra) # You might want to use package "kableExtra" as well.
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:huxtable':
## 
##     add_footnote
## The following object is masked from 'package:dplyr':
## 
##     group_rows
kable(tidy(model2))%>%
  kable_styling("striped", full_width = T, fixed_thead = T) %>%
  column_spec(c(1, 5), bold = T) %>%
  row_spec(c(2, 4, 6), bold = T, color = "white", background = "#ff6347")
term estimate std.error statistic p.value
(Intercept) 8.5809204 8.5369649 1.0051488 0.3281471
Daily.Confirmed 0.0004289 0.0005757 0.7450662 0.4658496
Daily.Deaths -0.0477788 0.0982195 -0.4864491 0.6325187
Still.Hospitalised -0.0062469 0.0049716 -1.2565095 0.2249941
Intensive.Care.Unit..ICU. -0.1711734 0.1457604 -1.1743475 0.2555592
Daily.Confirmed:Intensive.Care.Unit..ICU. -0.0000084 0.0000090 -0.9327804 0.3632776
Daily.Deaths:Intensive.Care.Unit..ICU. 0.0009637 0.0015100 0.6382359 0.5313570
Still.Hospitalised:Intensive.Care.Unit..ICU. 0.0001178 0.0000864 1.3642178 0.1893146
kable(glance(model2))%>%
  kable_styling("striped", full_width = T, font_size = 12) %>%
  column_spec(c(2, 4, 6), bold = T, color = "white", background = "#ff6347")
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.3670243 0.1208671 0.1881612 1.491016 0.2325214 7 11.31988 -4.639758 6.683111 0.6372836 18 26

The regression analysis came up with two significant interaction terms.

  • First, it appears that the relationships between Daily Intensive Care Unit(ICU) cases regarding average sentiment is different depending on daily confirmed cases.

  • Second, it appears that the relationships between Daily Intensive Care Unit(ICU) cases regarding average sentiment is different depending on daily deaths.

To see the patterns of interaction, we visualized the significant interaction effects on the next chapter.

pacman::p_load(jtools, huxtable, ggstance, interactions)

m1 <- lm(average_sentiment~  Daily.Deaths + Daily.Confirmed +Still.Hospitalised + Intensive.Care.Unit..ICU., 
         train.data)

m2 <- lm(average_sentiment ~ (Daily.Confirmed + Daily.Deaths +Still.Hospitalised) * Intensive.Care.Unit..ICU., 
         train.data)

export_summs(m1, m2, 
             error_format = "(t = {statistic}, p = {p.value})",
             align = "right",
             model.names = c("Main Effects Only", "with Interactions"),
             digits = 3)
Main Effects Onlywith Interactions
(Intercept)-1.1638.581
(t = -1.071, p = 0.296)(t = 1.005, p = 0.328)
Daily.Deaths0.012-0.048
(t = 1.058, p = 0.302)(t = -0.486, p = 0.633)
Daily.Confirmed-0.0000.000
(t = -1.558, p = 0.134)(t = 0.745, p = 0.466)
Still.Hospitalised0.000-0.006
(t = 0.306, p = 0.763)(t = -1.257, p = 0.225)
Intensive.Care.Unit..ICU.0.009-0.171
(t = 1.793, p = 0.087)(t = -1.174, p = 0.256)
Daily.Confirmed:Intensive.Care.Unit..ICU.-0.000
(t = -0.933, p = 0.363)
Daily.Deaths:Intensive.Care.Unit..ICU.0.001
(t = 0.638, p = 0.531)
Still.Hospitalised:Intensive.Care.Unit..ICU.0.000
(t = 1.364, p = 0.189)
N2626
R20.2750.367
*** p < 0.001; ** p < 0.01; * p < 0.05.
plot_summs(m1, m2, 
           scale = T,
           plot.distributions = T,
           model.names = c("Main Effects Only", "with Interactions")) +
  theme(legend.position = "top")

sim_slopes(m2,
           pred = Intensive.Care.Unit..ICU., 
           modx = Daily.Deaths,
           johnson_neyman = F)
## SIMPLE SLOPES ANALYSIS 
## 
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths =  9.371494 (- 1 SD): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.01   0.01     0.74   0.47
## 
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths = 12.769231 (Mean): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.01   0.01     1.36   0.19
## 
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths = 16.166968 (+ 1 SD): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.01   0.01     1.30   0.21
sim_slopes(m2,
           pred = Daily.Deaths, 
           modx = Intensive.Care.Unit..ICU.,
           johnson_neyman = T)
## JOHNSON-NEYMAN INTERVAL 
## 
## The Johnson-Neyman interval could not be found. Is the p value for your
## interaction term below the specified alpha?
## 
## SIMPLE SLOPES ANALYSIS 
## 
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 53.02064 (- 1 SD): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.00   0.02     0.16   0.88
## 
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 62.15385 (Mean): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.01   0.01     1.00   0.33
## 
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 71.28705 (+ 1 SD): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.02   0.02     1.36   0.19

Interpretation of the Results

The regression analysis came up with two significant interaction terms.

First, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s gender.

Second, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s education.

From our data science project, we could find the following two findings:

  1. The first case of COVID-19 found in Singapore was confirmed on 23 January 2020, and since nationwide partial lockdown or circuit breaker kicked in from 7 April 2020 to 1 June 2020, Singapore have been experiencing policies change from time to time to effectively cope with the surges. As new norm has been rooted in local lifestyle, no apparent jump or trend has been observed in public sentiment. In fact, we further analyse the effects from the most recent key announcements, a sudden increase of number of tweets and jump in overall sentiment appears that there is a causation effect, but it did not drive any specific emotion class to form trending. Despite that social curb extension announced expected to cause negative sentiment, PM address has overwhelmed the effects on it and sentiment emerge with more trusts to the leadership. Thus, information delivering, create awareness and generate leads seems to be effective in producing positive public sentiment.

  2. It appears that the relationships between Daily Intensive Care Unit(ICU) cases regarding average sentiment is different depending on daily confirmed cases. Second, it appears that the relationships between social comparisons orientation regarding average sentiment is different depending on daily death cases

Implications

Prof. Roh’s Note: “This is where you provide the significance of the findings. Unlike the other sections, where your goal is to describe the results that you found (what the data told you). This is where you chime in and proactively discuss the meaning of the results.”

Limitations and Future Directions

The public sentiment reflected and analysed by tweets data especially effective among younger age group who spend most of their time in social media and carefully tweets what is in their mind. This can serve only a small sample of a whole local population. If possible, we also need to filter out news publisher tweets and focus on individual tweets. In here, We are comparing sentiment change within group based on causal inference driven by nation-wide speech and announcements. On how COVID-19 has changed the sentiment in the society, we have to identify and aware of timeline on key events happened and also view from a longer term from pre-covid and covid era, hence a relative longer time period of data will provide more insights for analysing the differences. We are hold back by the free Twitter developer account which eligible to extract tweets up to 7 days in the past.

Local slang words or Singlish are casually and frequently used among local community and most of them could be a good gauge of emotion, E.g. even a dialect vulgar should reflect anger, thus enriching the emotion dictionary like NRC can better tune to understand the positive and negative in sentiment.

Feature Importance describe which features are relevant and is another aspect we should explore to help us better understanding of solved problem and sometimes lead to model improvements for accuracy by employing feature selection.

References

[1] Julia Silge & David Robinson Text Mining with R - A TIDY APPROACH O’Reilly (2017)

[2] Andrea Cirillo R Data Mining - Implement data mining techniques through practical use cases and real-world datasets Packt> (2017)

[3] Tony Carilli R Companion to Real Econometrics February 2021 https://bookdown.org/carillitony/bailey/chp11.html

[4] Ashwin Malshe(2019-06-25) Data Analytics Applications https://ashgreat.github.io/analyticsAppBook/collect-tweets.html

[5]Hui Xiang Chua Covid-19 Singapore https://data.world/hxchua/covid-19-singapore

[6]Singapore Public Data COVID-19 case numbers https://data.gov.sg/dataset/covid-19-case-numbers

Appendix

Custom Stop-words for Text Pre-processing in Word Cloud data overview

“singapore”, “covid”, “covid19”,“positive”,“negative”,“oct”,“nov”,“news”,“amp”,“reuters”,“news”,“daily”, “malaysia”,“november”,“october”,“october”,“press”,“journal”,“amid”,“weekly”,“days”,“weeks”,“china”, “chinese”,“report”,“am”,“pm”,“dont”,“taking”,“found”,“morning”,“bloomberg”,“months”,“month”,“india”, “media”,“week”,“read”,“reports”,“data”,“europe”,“monday”,“tuesday”,“wednesday”,“thursday”,“friday”, “satursday”,“sunday”,“wall”,“street”

The objective is to clean those are less relevant and very little meaning to find sentiment, such as punctuation, special character, prefix with number, hashtag, alias, links and custom terms above.

We removed duplicated text in tweets, sent from the same screen name multiple times. For instance, there are several news publishers have posted the same tweet on different days.

sessionInfo()

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] interactions_1.1.5 kableExtra_1.3.4   knitr_1.33         ggfortify_0.4.12  
##  [5] DT_0.19            caret_6.0-90       gifski_1.4.3-1     quantmod_0.4.18   
##  [9] TTR_0.24.2         xts_0.12.1         zoo_1.8-9          glue_1.4.2        
## [13] rtweet_0.7.0.9000  gtrendsR_1.4.8     gvlma_1.0.0.3      ggstance_0.3.5    
## [17] scales_1.1.1       ggthemes_4.2.4     sandwich_3.0-1     car_3.0-11        
## [21] carData_3.0-4      Hmisc_4.5-0        Formula_1.2-4      survival_3.2-13   
## [25] lattice_0.20-44    psych_2.1.9        gridExtra_2.3      huxtable_5.4.0    
## [29] jtools_2.1.4       textdata_0.4.1     reshape2_1.4.4     wordcloud_2.6     
## [33] RColorBrewer_1.1-2 wordcloud2_0.2.1   tidytext_0.3.1     lubridate_1.7.10  
## [37] modelr_0.1.8       broom_0.7.9        forcats_0.5.1      stringr_1.4.0     
## [41] dplyr_1.0.7        purrr_0.3.4        readr_2.0.1        tidyr_1.1.4       
## [45] tibble_3.1.5       ggplot2_3.3.5      tidyverse_1.3.1   
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1         backports_1.2.1      systemfonts_1.0.2   
##   [4] plyr_1.8.6           splines_4.1.0        crosstalk_1.1.1     
##   [7] listenv_0.8.0        SnowballC_0.7.0      digest_0.6.28       
##  [10] foreach_1.5.1        htmltools_0.5.2      fansi_0.5.0         
##  [13] magrittr_2.0.1       checkmate_2.0.0      cluster_2.1.2       
##  [16] tzdb_0.1.2           openxlsx_4.2.4       recipes_0.1.17      
##  [19] globals_0.14.0       gower_0.2.2          svglite_2.0.0       
##  [22] jpeg_0.1-9           colorspace_2.0-2     rappdirs_0.3.3      
##  [25] rvest_1.0.1          haven_2.4.3          xfun_0.25           
##  [28] crayon_1.4.2         jsonlite_1.7.2       iterators_1.0.13    
##  [31] gtable_0.3.0         ipred_0.9-12         webshot_0.5.2       
##  [34] future.apply_1.8.1   abind_1.4-5          DBI_1.1.1           
##  [37] Rcpp_1.0.7           viridisLite_0.4.0    htmlTable_2.2.1     
##  [40] tmvnsim_1.0-2        foreign_0.8-81       radarchart_0.3.1    
##  [43] stats4_4.1.0         lava_1.6.10          prodlim_2019.11.13  
##  [46] htmlwidgets_1.5.4    httr_1.4.2           ellipsis_0.3.2      
##  [49] farver_2.1.0         pkgconfig_2.0.3      nnet_7.3-16         
##  [52] sass_0.4.0           dbplyr_2.1.1         utf8_1.2.2          
##  [55] labeling_0.4.2       tidyselect_1.1.1     rlang_0.4.12        
##  [58] munsell_0.5.0        cellranger_1.1.0     tools_4.1.0         
##  [61] cli_3.1.0            generics_0.1.1       pacman_0.5.1        
##  [64] evaluate_0.14        fastmap_1.1.0        yaml_2.2.1          
##  [67] ModelMetrics_1.2.2.2 fs_1.5.0             zip_2.2.0           
##  [70] pander_0.6.4         future_1.23.0        nlme_3.1-152        
##  [73] xml2_1.3.2           tokenizers_0.2.1     compiler_4.1.0      
##  [76] rstudioapi_0.13      curl_4.3.2           png_0.1-7           
##  [79] reprex_2.0.1         bslib_0.2.5.1        stringi_1.7.4       
##  [82] highr_0.9            Matrix_1.3-4         commonmark_1.7      
##  [85] vctrs_0.3.8          pillar_1.6.4         lifecycle_1.0.1     
##  [88] jquerylib_0.1.4      data.table_1.14.2    R6_2.5.1            
##  [91] latticeExtra_0.6-29  rio_0.5.27           parallelly_1.29.0   
##  [94] janeaustenr_0.1.5    codetools_0.2-18     MASS_7.3-54         
##  [97] assertthat_0.2.1     withr_2.4.2          mnormt_2.0.2        
## [100] broom.mixed_0.2.7    mgcv_1.8-36          parallel_4.1.0      
## [103] hms_1.1.1            grid_4.1.0           rpart_4.1-15        
## [106] timeDate_3043.102    class_7.3-19         rmarkdown_2.10      
## [109] pROC_1.18.0          base64enc_0.1-3

Contribution Statement

Prof. Roh’s Note: “Please describe your individual contribution to the team’s project (in detail).”